[1] "Hello World!"
Programming for FinTech
Module 3: R
R Programming Basics
Print function
You can print the value on the console with print() function.
print() function is implicitly called if not supplied.
Basic Calculations
You can use R to do basic math calculations.
Question: What is 500 * (100 / 2.5) + 770?
Arithmetic Operators
- Addition (
+), Subtraction (-) - Multiplication (
*), Division (/) - Exponentiation (
^or**) - Modulo (
%%): Returns the remainder of a division
- Integer Division (
%/%): Returns the integer quotient
Logical Test Operators
- Less than (
<or>) - Less than or equal (
<=or>=) - Equality (
==) and Inequality (!=) - Logical NOT (
!) - Element-wise AND (
&), OR (|)
Exercise: Check
3 >= 5,TRUE == FALSE,TRUE & FALSE,(3 > 1) & (3 <= 5)TRUE | FALSE,!TRUE == FALSE!(3 > 1) & (3 <= 5)
Scalar
A scalar in R is a simplest data type and represents a single element, not a collection.
Atomic Vector
To combine multiple elements to a vector, use c() function.
[1] 1 2
[1] 2 3 4 6 7
[1] 3
In R, “scalar” is actually represented as a vector of length 1.
Integer sequence
Colon : generates integer sequence from:to.
Length of a vector
To check the number of elements of a vector, use length() function:
Vectorized operations
Vectorized operations mean vector-in-vector-out, as opposed to scalar operation (one-in-one-out).
In R, most operations are vector in mind.
For example, + and * are vectorized operators.
Question:
What do you get when c(2,3) * c(3,9)?
What’s the length of the output?
Recycling rule
When the operands are of different lengths, the shorter one is recycled as many times as necessary.
When it cannot be recycled entirely, it still works but raises a warning message:
Why vectorized?
Vectorized operation is much, much faster than iterating (looping) over each single (scalar) elements.
Avoid using loops and utilize vectorized operation whenever possible.
For computers, using vectorization or not is like a difference in our mental calculation between
- 3 * 9 and
- (3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3).
Exercises
What is wrong with
c(1 + 1, 2 + 1, 3 + 1)? How can you make it better?Answer how
c(1,2) * 1:3works.
REPL vs Scripting
REPL: Interactive programming
So far, we’ve done interactive programming, REPL:
- Read, Evaluate, Print, Loop
- For rapid prototyping, exploration, debugging, etc.
Continuation Prompt
On console: > means: “Waiting your command”
+ means: “Continue command”
CTRL + C to abort.
Script (Batch)
Instead, you can run the whole script outside of R environment, using Rscript
Running a script file on current R session:
Writing a complete script is your final goal in programming.
- For production and deployment
REPL for development, and script for production.
To run the (whole) script, Ctrl (Cmd) + Shift + Return
Binding Names (Symbols)
Use <- to bind a name (symbol) to an object.
Here, my_number is called symbol, or name of an object.
Style guide: though you still can use =, use <- for assignment.
Use = for specifying function arguments instead.
Some IDEs (i.e. RStudio / Positron / VScode) have Alt (Option) + - as a shortcut.
R has strict rules about a syntactic name (symbol).
- It is case sensitive
- It cannot contain whitespace
- It cannot start with numbers
Error in parse(text = input): <text>:2:2: unexpected input
1: my_number_1 <- 15
2: 1_
^
You can’t use reserved words like
TRUE,NULL,if, etc.If you’d deliberately use non-syntactic names, use backtick `
Object naming conventions
Since objects cannot contain whitespace as symbol, there are some naming conventions.
snake_casecamelCasePascalCase
It is better to make a short, self-explanatory name.
e.g. weight <- 15 is easier to understand than my_variable_quantity <- 15
Interactive prompt
readline() gemerates prompt for interactive input.
The response can be a value and assigned as an object.
In class Exercise
- Create a vector object of numbers: 98,99,100,101,102
- Assign above to a symbol
bond_prices. - If you follow CamalStyle naming convention, what would it be?
- Generate an interactive prompt that asks interest rate, and assign it with symbol
interest_rate
Evaluating vs Assigning
Consider the following code. What is the printed value of a?
The expression a * 2 is evaluated but not assigned to any variable, so a remains unchanged.
To store the result, you need to assign it:
R Data Types
Object types in R
Vector type: Common data types
Special type (non-vector): non-vectors
- Functions, Environments, etc.
Vectors are the most important family of data types in R.
Vector type
Vector is a data structure that stores multiple elements. It comes in two flavors:
- Atomic vector: all elements same type
- Generic vector: known as list, can have different types of elements
NULL is not a vector, but often serves as zero length vector.
Atomic vectors
There are four primary types of atomic vector in R, and two others.
Type of Atomic vectors
- Logical (or Boolean):
TRUE,FALSE,NA
- Integer: integer numbers
Attach L to treat the number as strict integer number.
- Double: real numbers
numeric is a collective term for both double and integers but often used as if it were a synonym for “double” or “real number” in practice.
- Character (or string): words, wrapped by
"or'
[1] "character"
[1] "character"
Style guide: Use double quote " for character instead of ' if possible.
- Two other types:
raw type: binary data type
complex type: complex numbers (e.g. 3 + 4i)
rarely needed in Finance
Missing values: NA
- Missing values are denoted by
NA- Not Applicable: similar to “undefined” above
- They are not identical to zero or NULL
NULLis intentional empty “placeholder” in R
NA is considered as logical length 1 vector.
NULL is a special type (NULL), length 0
NaN a numeric missing value, length 1.
- unrepresentable numeric results (e.g., 0/0, log(-1)).
Exercise
What are four primary types of atomic vector?
What are the types of
a,b,c,dbelow?
Confirm your answer with typeof().
List
List is a generic vector that is not atomic.
Atomic vector can have only one type for its elements (Double, Integer, Logical, …)
List can hold multiple data types for its member (even list itself)
[[1]]
[1] 1
[[2]]
[1] 3.5
[[3]]
[1] "Hi"
[[4]]
[1] TRUE
[1] "list"
List can have atomic vectors as its elements, with varying lengths:
[1] "list"
Question: What is the length of my_list?
R Object Attributes
Attributes
Attributes are metadata that is attached to R objects, providing additional information or functionality.
Common attributes:
Names: Labels for elements in a vector or list
Dimensions (
dim): Used for matrices, arraysClass: Defines how an object should be treated by functions
etc.
Names attribute
Elements of vector (atomic, generic) can be named.
There are roughly three ways to assign names attribute.
Method 1: names()
In R, often used attributes has its own access function named after its own, such as names(), class(), dim().
$names
[1] "AAPL" "GOOG" "MSFT" "AMZN"
AAPL GOOG MSFT AMZN
150 200 250 300
Method 2: attr()
Or use attr() function to set attribute:
$names
[1] "AAPL" "GOOG" "MSFT" "AMZN"
Method 3: by construct
Or assign names by construct:
Exercise
Give names attribute to stock_prices vector using aforementioned three methods.
names(obj) <-attr(obj, "names") <-- by construct
Dim attribute
Adding a dim attribute to a vector allows it to behave like a 2-dimensional matrix or a multi-dimensional array.
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
Class attribute
The class attributes in R is used to define the behavior of objects with functions.
Especially important classes in Finance are:
- Date, Time
- Factors
- Dataframe
- or your own custom-built class
Example 1: Date/Time
Very important class in Finance.
They are built from double type atomic vector (type), but has own specific rule for uses (class).
[1] "2026-02-08"
[1] "2026-02-08 15:12:57 EST"
Check their data type:
Check their attributes: they have class attributes.
To directly access the class attribute:
Class attribute and change of behavior
For an example, see how it works with + function.
[1] "2026-02-09"
[1] "2026-02-08 15:12:58 EST"
Q: Why +1 yield different results?
A: Because they are in different classes. +1 is inferred differently.
Class attribute gives some context how it should behave with functions.
In the deep down, they are just numbers:
- Date: The value of double represents the number of days since “1970-01-01” (Unix Epoch)
- Time: the number of seconds since Unix Epoch
Time zone attribute
Time has “tzone” attribute (time-zone) that controls “formatting” of date-time.
- The lower-level data (double) for the time is not changing.
[1] "2026-02-08 15:12:57 EST"
[1] 1770581578
[1] "2026-02-08 20:12:57 UTC"
[1] 1770581578
attr(,"tzone")
[1] "UTC"
Example 2: Factors
Factors (or Categorical) can only have a set of predefined values.
- It is built on top of integer type
[1] "integer"
[1] "ordered" "factor"
[1] "integer"
[1] "factor"
If they were stripped off all attributes:
Base types and Class
Example 3: Dataframe & tibble
A class built on top of list type, with 2D tabular representations
- Similar to matrix (2D form)
- Dataframe and tibble can have different types of columns
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
[1] "list"
[1] "data.frame"
If class attribute was removed: it turns back to list
Let’s browse the attributes of iris dataframe:
$names
[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
$row.names
[1] 1 2 3 4 5 6
$class
[1] "data.frame"
Tibble is a robust dataframe class.
- It has better printing output than data.frame
- Convert class with
as_tibble()
# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
Implicit Class
Those “building-block” types in R have implicit class:
- Atomic vectors: (primary data types - Logical, Integer, Double, Character)
- Generic vector (list)
- Matrix / Array
- etc.
Their class is not shown in the attributes(), but still shown when explicitly asked with class()
Exercise
- Execute
typeof(c(1,2,3))andtypeof(c(1L, 2L, 3L)). What’s the difference? - Assign a vector with three elements: 1,3,5, and name it as
my_first_object - Assign another object with one element: 5, name it as
MySecondObject - Multiply
my_first_objectwithMySecondObject. What do you get? - Assign a vector with your name:
my_name - What do you get when you execute
my_name + 3? (Expect error)
Importance of class
Since class determines the behavior of the object, it is crucial to know your data class especially performing function calls.
Calling a function means executing/applying a function.
As you cannot use add function on character and numeric.
[1] 7
Error in 3 + "Hi": non-numeric argument to binary operator
Other example: c()
Some functions coerces the class / type instead.
[1] "character"
[1] "character"
c() does coercion when inputs are in different types, but not all functions do coercion.
Exercise: coercion
Let’s see how coercion works with c().
What is the type of test?
What is the class of test?
Coercion precedence
In general, coercion is performed in fixed order:
Character (heavy) << Double << Integer << Logical (light)
Quick Check
What would be the type of:
Data type and memory footprint
Given the same length, logical and integer takes least amount of memory, then double, then character
R Access Operations
Vector Indexing
Single square brackets
[and]- Subset multiple elements from vector
Double brackets
[[and]]- Subset single element (scalar) from vector
Vector indexing: Single brackets
Square brackets [ selects multiple elements of vector
- by index: x[1] retrieves first element
- by name: x[“Bob”] retrives element named “Bob”
[1] 150
MSFT GOOGL
205 250
Elements can be accessed with index:
- Positive integer: select
- Negative integer: exclude
- You can’t mix positive and negative index
Vector indexing: Double brackets
Use double brackets [[ on vector when you want to select single element (Scalar).
Style guide: while single bracket on vector still works, use [[ on vectors to reinforce your expectation.
Subset & Assignment on Vector
Subsetting vector can be combined with assignment <- to modify selected values.
AAPL MSFT GOOGL AMZN
150 205 250 303
AAPL MSFT GOOGL AMZN
235 205 250 303
Assigning multiple elements:
AAPL MSFT GOOGL AMZN
235 265 250 265
Exercise
- Generate stock price vector:
Access 1st and 3rd element of
stock_pricesAssign “MSFT” and “AAPL” to 300.
You can subset vector with logical vector inside brackets:
AAPL MSFT GOOGL AMZN
FALSE FALSE TRUE FALSE
Exercise
Generate below stock prices:
Subset and get the following:
- Stocks whose prices greater than $250
- Subset stocks whose prices that are even
- Modify prices greater than $600 to $360
List Indexing
There are 3 ways to index lists that each has own merits:
- Single square brackets
[]: returns original (list) type - Double square brackets
[[]]: returns element’s type - Dollor sign operator
$
A list is like a train carring multiple cars:
List Indexing: Single brackets
Single bracket returns a list object, train.
List Indexing: Double brackets
Double bracket returns element’s type, car.
List Indexing: $ Operator
$ is a shorthand operator for double bracket [[ with a variable name.
- Access variable without quotes (
") - Autocompletion friendly
Portfolio example
Construct a list of portfolio:
Subset by number index:
Subset by name:
Assignment on Lists
You can use chained bracket operation and assignment:
$stocks
AAPL TSLA
150 630
$bonds
TBOND TBOND5Y
1000 210
$cash
[1] 5000
$brokerage
[1] "Robinhood"
You can remove a component of list by assigning NULL
Element removal on atomic vectors
Assigning NULL to an element of a vector doesn’t work:
Error in prices[[1]] <- NULL: replacement has length zero
Should use negative indexing & overwrite in this case:
Exercise
Create a list object portfolio as:
Subset the portfolio to:
- Retrieve the stocks vector using index
- Retrieve the cash vector using its name
- Return a list of stocks and cash using their names
- Return a list containing only bonds
- Remove brokerage element from portfolio
R Functions
Functions
Everything that exists is an object.
Everything that happens is a function call.— John Chambers
Function calls
Fuction calls in R come in four varieties.
- prefix: function name comes before its arguments
- infix: function name comes in between arguments
- replacement: function that replaces value by assignments
- special: Built-in R syntax like
[[forif, and don’t have consistent structure.
Rewriting to prefix form
An intersting property of R is that every inflix, replacement, and special form can be rewritten in prefix form.
- use backtick
`to wrap the function symbol
Our first prefix function call, c() concatenates all the values and generate a single object.
c()has arbitrary number of arguments (...)
seq() function generates a sequence of numbers.
- It has three arguments:
fromandtoandby.
- If users don’t specify the argument name, it reads input in order.
- If user enters more args than the function space, raises error.
Some functions has pre-defined argument value:
Defining a function
You can define custom function (User-defined function) in R with the following syntax:
The function can be called in prefix form:
Default argument value can be assigned by construction:
Q: What would happen if user calls above function with
my_first_function()
Curly Braces
By default, R evaluates each line as in individual statement.
Using curly braces { } allows you to group multiple expressions into a single unit that executes together.
Exercise
Prep: Load stringr package with library(stringr)
Write your own function
say_hello()that takes no argument. It prints “Hi!” when called.Now tweak the function to accept an argument,
name. It prints “Hello, {name}!”
- Use
str_glue("Hello, {name}!")
Returning Values in Functions
return() in function can be served as an early exit: all remaining code won’t be executed.
Function Fundamentals
A function has three parts:
The
formals(): list of argumentsThe
body(): code inside the functionThe
environment(): where you defined the function- “GlobalEnv” : top-level workspace in the R session
Check formals(), body() and environment():
Some functions are found from external packages:
Getting Help on Functions
All R functions are built by someone, and documentation is typically provided.
For detailed description of any function, use ? followed by the function’s name.
For example, try below code in your console:
Or, use help()
Build a Perpetuity Calculator
The present value of a perpetuity, where the cash flow grows at a constant rate g, is given by:
[ PV_{PER} = ]
where
- PMT is the payment or cash flow.
- r is the discount rate.
- g is the growth rate of the cash flow.
This formula applies when r > g.
Defining a function
You can design your perpetuity function in R with following syntax:
Calling a function
Let’s call the function above:
- What is the PV of perpetuity, when PMT = $10,000, r = 7% and g = 3%?
- Assign the result value of the function
- Vector can be the input (vectorized)
Exercise
Define a perpetuity calculator function,
pv_per(). What is the pv when PMT = $50,000, r = 4%, g = 0%?What is the pv when PMT = $50,000, r = 4%, but g are 1%, 2%, 3%?
Default Arguments
What happens if user doesn’t specify one argument?
Error in pv_per(10000, 0.07): argument "g" is missing, with no default
You can set default values for arguments, allowing them to be omitted when calling the function.
Example 2: Black-Scholes Pricing
Functions can do more complex calculations. Following the Black-Scholes put / call pricing formula, we can generate function as below:
bsm_price <- function(S0, K, r, T, sigma, type = "call") {
d1 <- (log(S0 / K) + (r + 0.5 * sigma^2) * T) / (sigma * sqrt(T))
d2 <- d1 - sigma * sqrt(T)
if (type == "call") {
return(S0 * pnorm(d1) - K * exp(-r * T) * pnorm(d2))
} else if (type == "put") {
return(K * exp(-r * T) * pnorm(-d2) - S0 * pnorm(-d1))
} else {
stop("Invalid option type. Use 'call' or 'put'.")
}
}Calculate price estimates with four scenarios:
Anonymous function
Functions are typically named so they can be reused multiple times.
However, you can skip naming a custom function, and they are called anonymous function.
- Useful when the function is simple and called only one time.
They are not stored as objects since they do not have assigned symbols (names).
Syntactic sugar: Function
Syntactic sugar refers to a feature in programming that makes the code simple to read or write, without adding functionality.
(Anonymous) functions can be defined with syntactic sugar (concise expression):
Exercise
Convert below perpetuity function (pv_per) to anonymous function:
Syntactic sugar: Pipe Operator
A Motivating example
Solve below math problem. Describe your steps. What was the first and the last step?
\[ \sqrt{(2+4)^2 - 3 * 4} = ? \]
\[ \sqrt{(2+4)^2 - 3 * 4} \]
- Do 2+4
and thensquare it, and save it in your memory - Do 3*4
and thensubtract it from previous, and update your memory and thensquare root the value
Similarly, codes can be written not in the order we calculate.
It is easier for us to read & write code in the order it is operated.
When we have composite function calls such as
The call sequence is x -> k() -> h() -> g() -> f().
It is rather easier to read, write and debug if we can write a code like:
Pipe operator & Function Chain
This is where pipe operator |> becomes handy in R.
The pipe operator does “and then” job, and it can be written as:
Style guide: use |> instead of %>%. Use shortcut Cmd (Ctrl) + Shift + M.
Sometimes you’ll see %>% operator instead, which comes from external library in R (magrittr), meanwhile |> is R native. In order to use %>%, external package library(magrittr) should be imported.
Exercise (challenge!)
Solve \(\sqrt{2^3}\) using pipe operator.
- First, solve above procedual way
- For square root, use
sqrt()function
- Next, solve using the pipe operator.
- Define function named
cubethat doesx^3 - Code should start with
2.
External packages
Packages are add-on libraries that extend the functionality of R.
- They provide additional functions, datasets, and tools for various tasks
- Can be easily installed with
install.packages() - And loaded in R session with
library()
Installing packages:
Load packages: you need to load packages to use its functionality.
- Need to load only once per session
Control Structure
Control Structure?
Control structure dictates which code gets executed and when.
- Conditional Statements:
ifstatements: Execute code if a condition is true.else/else ifstatements: Execute code if the condition is false.
- Loops:
forloops: Repeat code block a specified number of times.whileloops: Continue executing code as long as a condition is true.
- Map (apply):
- Map a function to each element of a collection without explicitly writing loops.
If-else
The basic form of if and if-else statement in R:
[1] 1 2
Example 1: if and else executes code based on logical conditions.
[1] "The stock price has increased significantly!"
Example 2: If condition is not met, then nothing happens (skipped).
Example 3: else if checks one more logic condition:
[1] "The stock price has increased moderately."
Example 4: There can be multiple else if
[1] "The stock price has increased slightly."
Example 5: else is executed when all of if conditions are not met.
[1] "The stock price has decreased."
Exercise
Write an if-else statement:
- If PMT > 1000, add PMT with 10000 (i.e.,
PMT <- PMT + 10000) - Else if PMT > 500, add PMT with 100
- Else, set PMT = 0
What is the outcome of above if-else, if initial PMT was 750?
For loops
For loops are used when code has to be iterated a specified number of times.
If for loop was explictly written:
[1] 1
[1] 2
[1] 3
[1] 4
[1] 5
for assigns the item in the current environment, overwriting existing variable with the same name.
Items are accessed one by one in vector in for loop:
[1] "AAPL"
[1] "BAC"
[1] "C"
[1] "DAL"
To use index for each element: use seq_along() on the vector.
[1] 1 2 3 4
When looping over date / times, loops strip the attributes:
[1] 18262
[1] 18383
To workaround, use indexing with seq_along() and [[.
For loop: preallocation
Memory Preallocation is creating the full size of the output object before the loop.
For example:
[1] 0 0 0 0 0 0 0 0 0 0
[1] 1 4 9 16 25 36 49 64 81 100
Important tips when looping:
- Use bracket indexing
[]instead ofc() - Preallocating the size of container is strongly recommended.
Best Example
[[1]]
[1] 1
[[2]]
[1] 4
[[3]]
[1] 9
[[4]]
[1] 16
[[5]]
[1] 25
[[6]]
[1] 36
[[7]]
[1] 49
[[8]]
[1] 64
[[9]]
[1] 81
[[10]]
[1] 100
Bad Example
[1] 1 4 9 16 25 36 49 64 81 100
Good Example
If preallocation is cumbersome, use list() for output container then convert to a vector if needed.
Research: Benchmarking
N <- 5000
list_prealloc <- vector("list", length = N)
list_noalloc <- list()
vector_prealloc <- vector("numeric", length = N)
vector_noalloc <- numeric()
bench::mark(
list_prealloc_bracket = for (n in 1:N) {
list_prealloc[[n]] <- n**2
},
list_noalloc_bracket = for (n in 1:N) {
list_noalloc[[n]] <- n**2
},
vector_prealloc_bracket = for (n in 1:N) {
vector_prealloc[[n]] <- n**2
},
vector_noalloc_bracket = for (n in 1:N) {
vector_noalloc[[n]] <- n**2
},
list_noalloc_c = for (n in 1:N) {
list_noalloc <- c(list_noalloc, n**2)
},
vector_noalloc_c = for (n in 1:N) {
vector_noalloc <- c(vector_noalloc, n**2)
},
list_prealloc_c = for (n in 1:N) {
list_prealloc <- c(list_prealloc, n**2)
},
vector_prealloc_c = for (n in 1:N) {
vector_prealloc <- c(vector_prealloc, n**2)
},
iterations = 5,
check = FALSE
)# A tibble: 8 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 list_prealloc_bracket 872µs 907µs 1074. 54.8KB 0
2 list_noalloc_bracket 905µs 962µs 669. 830.1KB 134.
3 vector_prealloc_bracket 840µs 882µs 1026. 54.8KB 0
4 vector_noalloc_bracket 886µs 924µs 1086. 830.1KB 0
5 list_noalloc_c 419ms 640ms 1.53 286.4MB 31.8
6 vector_noalloc_c 128ms 264ms 4.24 286.4MB 89.9
7 list_prealloc_c 357ms 638ms 1.53 286.4MB 32.5
8 vector_prealloc_c 123ms 220ms 4.28 286.4MB 92.5
Verdict
When performing loops:
Preallocation + bracket indexing
[]is the best.No preallocation is forgivable.
Repeated use of
c()is strongly discouraged.
next and break
Generally used with if-else condition tests inside loop.
next is used to skip an iteration of loop.
break is used to exit loop immediately.
For loop: Compound interest
How to calculate compound interest over multiple years using a for loop?
- Principal: $10,000
- Interest rate: 5%
- Number of years: 10
[1] 10500.00 11025.00 11576.25 12155.06 12762.82 13400.96 14071.00 14774.55
[9] 15513.28 16288.95
Exercise
Based on the previous example, do the following:
Q1. Skip the first year using if and next
- Cashflow should have zero (NULL) on the first slot
Q2. Stop the calculation if value exceeds $14,000
- Cashflow should have zero (NULL) on slots that exceed value of $14,000
While loops
While loops begin with testing condition, and iterates the code as long as the condition is TRUE.
- If not written properly, it can be infinite loop.
Exercise
Write code that print 1 to 10 using for loop.
Achieve same result using while loop instead.
- Start by defining
a <- 1outside of the loop
- Start by defining
Based on 2, tweak the code that skips printing number if it is 5.
- Be careful not get into infinite loop!
Exercise 2
Write a function that checks class of an input.
If the input is numeric, print “Numeric input!”, otherwise, print “Not numeric!”
- use
inherits(x, "numeric")for logical test.
Function mapping
map function from purrr is an implicit function loop.
- a function
fis an input arg formap() - Succinct and easy to read than
forloops map()requirestidyverseorpurrrpackage
Functions that take other function as inputs are called functionals in R, like map().
Remember, though, if vectorized operation is possible, avoid using for loops or map.
Example: map()
- Output is always list
[[1]]
[1] 2
[[2]]
[1] 3
[[3]]
[1] 4
With for loop, code tends to be longer and requires preallocation.
Exercise
Generate
times_two()function that multiplies input by 2.map
times_twofunction over1:10Achieve same result with for loop.
map function 2
If the desired output is not list but atomic vector:
map_dbl()a numeric (double) vectormap_chr()a character vectormap_lgl()a logical vectormap_int()an integer vector
Vectorized Operation and Loops
Most function operations in R are vectorized by default.
- Intuitive and faster: easier to read, write
- R is built for those operations in mind
- Avoid using
forloops ormapif vectorization is possible
Example: portfolio value
[1] 1500 1250 2000
A for loop approach:
[1] 1500 1250 2000
A map approach:
- map
*function to two input vectors (price, share) map2()for this case, see?map2for more info
Benchmark comparison
bench::mark(
vectorizing = {portfolio_value <- stock_prices * shares_held},
map2 = {portfolio_value <- map2(stock_prices, shares_held, `*`)},
for_loop_prealloc = {
portfolio_value <- vector("numeric", N) # container
for (i in 1:N) {
portfolio_value[[i]] <- stock_prices[[i]] * shares_held[[i]]
}},
iterations = 100,
check = FALSE
)# A tibble: 3 × 6
expression min median `itr/sec` mem_alloc `gc/sec`
<bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
1 vectorizing 82ns 82ns 8966997. 0B 0
2 map2 49µs 51.5µs 18873. 264B 0
3 for_loop_prealloc 851µs 951.3µs 1042. 20.1KB 32.2
Exercise
Generate portfolio value of each asset, using:
- Vectorized multiplication (vector output)
- for loop
- map (implicit loop)
Vectorized if-else
ifelse() function is a vectorized if else statement.
- Useful when you have a vector of TRUE / FASE condition tests
- No need to loop over each element of vector
Example: Dividend Payments
Exercise
Practice ifelse():
- if stock price is greater than 53, assign “Bull”
- otherwise “Bear”
- assign it to
sentimentobject.
Vectorized if-else 2
case_when() from tidyverse package is a general vectorized if-else.
[1] "Bear" "Bull" "Weak" "Bear" "Normal" "Bear"
Exercise 2
Practice case_when():
- if stock price is greater than 58, assign “Bull”
- if stock price is greater than 50, assign “Normal”
- otherwise “Bear”
- assign it to
sentiment2object.
File Systems in R
Package fs
fs package provides simple and consistent way to:
- Path operations
- File and directory control
- File information
- Cross-platform
What is a Path?
A path is a string of characters used to uniquely identify a file or folder in a file system.
Types of paths:
- Absolute path: exact location of a file or directory from the root.
- Relative path: location relative to the current working directory.
Working Directory
The working directory is the location where the program (R, bash, Python, etc) is running on.
getwd()shows the current working directory.setwd("/path/to/directory")changes working directory to specified path.
Absolute Paths
- Begins from the root directory (
/in Mac/Linux,C:\in Windows) - Example (Mac/Linux):
/Users/username/Documents/project/data.csv - Example (Windows):
C:\Users\username\Documents\project\data.csv
Absoulte paths are unambiguous.
Relative Paths
- Path that is relative to the current working directory.
- Example:
./data/project/data.csv(The.denotes the current directory) - Succinct and easier to manage path in projects
Directory references
.: the current directory.
- From
/Users/john/projects,./datarefers to/Users/john/projects/data.
..: the parent directory; one level up from the current directory.
- From
/Users/john/projects,../datarefers to/Users/john/data.
~: the home directory.
- Default directory for user in OS
Home Directory
The “default” directory for user in operating system
Mac/Linux: /Users/<username>. - Example: If your username is john, your home directory would be /Users/john.
Windows: C:\Users\<username> - Example: If your username is john, your home directory would be C:\Users\john
Tilde ~
Represents the user’s home directory.
Example: ~/cases refers to the cases folder in the user’s home directory:
C:\Users\<username>\casesfor windows/Users/john/casesfor Mac/Linux
Creating File/Directory
Creatie / delete file and directory are simple:
List files and directories
List files and directories:
It’s especially useful with globbing / regex:
Exercise
- Create an R script file named:
fs_exercise.Ron your working directory.
List all files that has
.Rfile extension.What is the absolute path of the script file?
From your home(
~), what is the relative path of the script file?
Text Data Files
A plain, human-readable text data file, delimited by a specific character
- Comma-Separated Values (CSV) with
, - Tab-Separated Values (TSV) with
\t - Since it is text, R tries to “guess” the correct data type of each column when importing
Text Data Example
A text data file typically looks like:
- Usually the first line is a header (column names)
- Data values separated by a delimiter (e.g.,
,for CSV,\tfor TSV)
Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
Text Data files
Many packages support writing/reading csv/tsv files;
- base R (
utilspackage): basic, slow readrfromtidyverse: extremely fast, functional
Write CSV / TSV
To write a data.frame to a csv file: write_csv()
To read a .csv / .tsv file to a data.frame: read_csv(), read_tsv()
Other data formats
There are other common data formats:
- “.xlsx”: excel spread sheets
- “.json”: javascript object notation (NoSQL)
- “.parquet”: columnar big data storage
- ‘.sas7bdat’, ‘.dta’
R Data frame (and tibble) class
Data frames
One of the most important data class in R, built on top of list type.
Stores data structure in 2D tabular form:
with rows (observations, or records)
and columns (variables)
Columns can be different types!
Create data frame
Creating a data.frame is almost identical to list.
Exercise
Create a dataframe named as housing:
- 6 columns: Name, Age, Sex, Income, Housing, Zipcode
- Name: Amy, Bill, Charles, Donna, Eckert
- Age: 21, 25, 30, 38, 49
- Sex: Female, Male, Male, Female, Male
- Income: 36000, 53000, 89000, 82000, 166000
- Housing: Rent, Rent, Own, Own, Rent
- Zipcode: 12333, 12543, 11255, 12333, 33533
What are the type (class) of each column automatically recognized by R?
- Check with
str(housing).
Q: What should be their type (class) in theory?
Tibble class
Essentially the same as dataframe class, with some fix:
- Fixes old inconsistencies in R data.frame class
- Safer executions
- Better console displays
as_tibble() converts data.frame class to tibble class.
Example
A toy dataset, iris dataframe:
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
Class of iris:
Convert iris to tibble class:
- Prints the dimension
- Prints data class by column
# A tibble: 3 × 5
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
iris_tb is a multi-class object that is both tibble and dataframe.
Access operations
As it is built on lists, [, [[, $ also works on data.frames.
- bracket operations align more consistently with
tibbleclass
Dataframe can be subsetted with df[i,j]
ipart operates on row (called filtering)jpart selects columns
Bracket [, [[ subsetting
1st row, 1st column, in element class (numeric vector)
Double bracket [[ works with , only when both row and columns are mentioned. That is, iris_tb[[1,1]] works, but iris_tb[[,1]] doesn’t.
To pull in element’s (vector) class, you’ll learn pull().
Other examples:
If comma is not provided, it assumes a column index.
$ subsetting
$ pulls a single column in element’s class from data frame (tibble).
Exercise
Generate
iris_tbby convertingirisusingas_tibble().Exercise all subsetting methods on rows in the 2nd column:
- Use single bracket and integer index
iris_tb[2] - Use double bracket and integer index
iris_tb[[2]] - Use single bracket and column name
iris_tb[colname] - Use single bracket and column name
iris_tb[[colname]] - Use
$
- Filter rows with “Sepal.Length > 5”. How many rows do you observe?
- use
nrow()to check the number of rows.
Chained subset call
Subsetting can be chained
- Pause and take a look at example. What is happening?
- Confirm with
iris_tb[45, c("Petal.Length", "Sepal.Length")]
Exercise
On iris_tb,
- Filter with “Petal.Length > 4.5”
- Select columns “Species”, “Petal.Length”
- then slice rows from 1 to 10
- Store this as
filtered_iris
Q. Confirm the average of Petal.Length from filtered_iris is 4.74.
- Use
mean(dataframe$column).
Assign and remove column
They operate as same as lists. To assign a new variable within the data.frame, use:
To remove a variable from the data.frame, use:
Modern syntax: dplyr package
R package for dataframe manipulation tasks.
- A grammar of data manipulation
- Replaces the use of
[,[[,$in most cases - Intuitive and easy to understand
- Fast, written in C++
Prep: Company Financials Data
Company_financials.csv data will be available in our GitHub Class repository.
- Option 1: Git pull and copy the data to your class working folder
- Option 2: Direct address typing as below
Data Overview
- Balance Sheet items: Assets, Liabilities, Equity, etc
- Year: reporting year
- Industry: Industry classification for each company
- Company: Ticker symbol
The select() verb
select() lets you choose specific columns.
- by column names
- by index
- by helper functions (starts_with, ends_with, etc)
Suppose you want to select all current items that starts with “current”.
or ends with “libabilities”.
or contains “asset”
The relocate() verb
relocate() is used to change the order of columns.
The rename() verb
rename() changes the column names.
The pull() verb
pull() extracts a single column as a vector.
Exercise
Using the dataset fin_data:
Create new tibble that includes only “ticker”, “Industry”, “year”, “market_cap” and column that starts with “current”.
Relocate column “market_cap” as the first column.
Rename “market_cap” to “Market_Cap” column.
Pull “ticker” as a vector from
fin_data.Combine 1 to 4 with pipe chain to achieve all at once.
The filter() verb
filter() lets you choose rows based on conditions.
Apply filters across columns: if_any() and if_all()
Imagine your dataset includes multiple asset columns (e.g., current assets and current liabilities). You want to filter rows where any asset value exceeds $100B (1e11).
if_all() for strict filtering:
The slice() verb
slice() extrats rows based on simple positions.
The disctinct() verb
distinct() removes duplicate rows based on referred columns.
Exercise
From fin_data:
Filter that contains only rows where year is greater than 2022 and Industry is “Financials”.
Filter rows where any of columns that contains “asset” exceeds $100B (1e11).
Filter rows where all of columns that contains “current” exceed $10B (1e10).
Slice first 3 rows of the data.
Show distinct values of “Industry” in the data, and keep other columns.
The arrange() verb
arrange() reorders the rows by one or more columns.
Best practice: DO NOT CHAIN ARRANGE - it resets reordering.
The mutate() verb
mutate() lets you create or modify columns.
Using general if-else with case_when() to classify:
Exercise
From fin_data:
Arrange the data by ticker (ascending) and year.
Create new variable “debt_to_asset_ratio” as the ratio of current_debt to current_assets.
The summarize() verb
summarize() computes statistics for the entire dataset.
You can summarize by groups:
Or simply use .by in the summarize()
Lab Problem: Year-over-Year Growth Calculation
From fin_data:
Arrange the dataset by ticker and year in ascending order. Then, group the data by ticker.
Use mutate() along with the lag() function to calculate the year-over-year growth rate for current_assets. Name the variable as
yearly_asset_growth.
\(\frac{\mathrm{current_assets} - \mathrm{lag(current_assets)}}{\mathrm{lag(current_assets)}}\)
- Summarize the average of yearly growth rate by each tickers.
dplyr grammar summary
Key verbs
select(): select subset of columns
rename(): rename columnsrelocate(): change column positionspull(): extract single column as vector
filter(): select subset of rows with condition
slice(): extract specific rowsdistinct(): remove duplicate rows
arrange(): reorder rowsmutate(): add new columns (variables)summarize(): generate summary table
group_by()/ungroup()
Portfolio Sorting with Crypto
A mini finance project
Incorporating AI for Coding
From now on, I’ll introduce how to leverage AI for coding.
Generating code snippets
Troubleshoot and debug
Best Practices
Sample GenAIs for code
- ChatGPT
- Grok
- Claude.ai (limited)
- Meta.ai
- Gemini
Crypto Analysis with dplyr and tidyverse
Learn to:
- Financial data manipulation
- Calculate average returns and volatility
- Sort cryptocurrencies into portfolios
- Compare performance of different portfolios
- Visualize performance
Let’s optimize our crypto investments!
Prep Required Libraries
Warning: package 'zoo' was built under R version 4.4.3
Our list of cryptos: 9 sample
Get data
How many observations are found for each crpto?
Calculate Returns
Calculate daily returns with arrange(), group_by() and mutate()
# A tibble: 6 × 9
symbol date open high low close volume adjusted daily_ret
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ADA-USD 2020-01-01 0.0328 0.0338 0.0327 0.0335 22948374 0.0335 NA
2 ADA-USD 2020-01-02 0.0335 0.0335 0.0324 0.0328 20843934 0.0328 -0.0211
3 ADA-USD 2020-01-03 0.0327 0.0344 0.0325 0.0342 30162644 0.0342 0.0436
4 ADA-USD 2020-01-04 0.0342 0.0347 0.0339 0.0346 29535781 0.0346 0.0121
5 ADA-USD 2020-01-05 0.0346 0.0354 0.0345 0.0347 21479178 0.0347 0.00364
6 ADA-USD 2020-01-06 0.0348 0.0373 0.0347 0.0373 37988444 0.0373 0.0735
Performance metrics
Calculate performance metrics with group_by() and summarize()
# A tibble: 9 × 3
symbol avg_daily_ret vol_daily
<chr> <dbl> <dbl>
1 ADA-USD 0.00356 0.0588
2 BNB-USD 0.00421 0.0570
3 BTC-USD 0.00150 0.0379
4 DOGE-USD 0.00807 0.135
5 DOT-USD 0.00270 0.0678
6 ETH-USD 0.00333 0.0505
7 MATIC-USD 0.00668 0.0805
8 SOL-USD 0.00548 0.0793
9 XRP-USD 0.00247 0.0634
Visualize metrics
Visualize performance metrics with ggplot(), to generate barplot:
performance_metrics |>
ggplot(
aes(x = fct_reorder(symbol, -avg_daily_ret), y = avg_daily_ret, fill = symbol )
) +
geom_col() +
scale_y_continuous(labels = scales::percent_format())+
labs(
title = "Average Daily Return of Cryptos",
subtitle = "Year 2020 - 2022",
caption = "Data: Yahoo Finance",
x = "Crypto",
y = "Average Return",
fill = "Symbol"
) +
theme_minimal()Similarly for volatility:
performance_metrics |>
ggplot(
aes(x = fct_reorder(symbol, vol_daily), y = vol_daily, fill = symbol )
) +
geom_col() +
scale_y_continuous(labels = scales::percent_format())+
labs(
title = "Average Daily Return of Cryptos",
subtitle = "Year 2020 - 2022",
caption = "Data: Yahoo Finance",
x = "Crypto",
y = "Average Return",
fill = "Symbol"
) +
theme_minimal()To combine and juxtapose (simple):
- Can’t use double Y axis with this case
- Reorder factor before pivotting if needed
performance_metrics |>
mutate(symbol = fct_reorder(symbol, desc(avg_daily_ret))) |>
pivot_longer(cols = !symbol) |> # make long form
ggplot(
aes(x = symbol, y = value, fill = name)
) +
geom_col(position = "dodge") +
scale_y_continuous(labels = scales::percent_format()) +
labs(
title = "Average Daily Return / Volatility of Cryptos",
subtitle = "Year 2020 - 2022",
caption = "Data: Yahoo Finance",
x = "Crypto",
y = "Average Return / Volatility (%)",
fill = "Metric"
) To combine and juxtapose (advanced):
- Dual Y Axis technique
# Since return is smaller: scale by their max values
scale_factor <- max(performance_metrics$avg_daily_ret) / max(performance_metrics$vol_daily)
performance_metrics |>
ggplot(aes(x = fct_reorder(symbol, -avg_daily_ret))) +
geom_col(
aes(y = avg_daily_ret, fill = "Average Return"),
position = position_nudge(x=-0.2), # move to left
width = 0.4
) +
geom_col(
aes(y = vol_daily * scale_factor, fill = "Volatility"), # notice the scale factor
position = position_nudge(x=0.2), # move to right
width = 0.4) +
scale_y_continuous(
name = "Average Return (%)",
labels = scales::percent_format(),
sec.axis = sec_axis(
\(x) x / scale_factor,
name = "Volatility (%)",
labels = scales::percent_format())
) +
labs(
title = "Average Return and Volatiliy, Dual Axis",
subtitle = "Year 2020 - 2022",
caption = "Data: Yahoo Finance",
x = "Crypto",
fill = "Metric"
) +
theme_bw()Portfolio Sorting: Trading volume
A simple univariate portfolio sorting to see if “trading volume” predicts future returns.
- Calculate average trading volume and sort into quintile (5) groups
# A tibble: 6 × 3
symbol avg_daily_volume volume_rank
<chr> <dbl> <int>
1 ADA-USD 1874739609. 3
2 BNB-USD 1554934937. 2
3 BTC-USD 36702325339. 5
4 DOGE-USD 1712255288. 3
5 DOT-USD 1392992622. 2
6 ETH-USD 18924519134. 4
Let’s test if volume explains future crypto returns:
- Volume observed data from 2020-01-01 to 2023-01-01
- 1 month future daily returns
Join rank (from past observation) to future crypto data using “symbol” as key
# A tibble: 6 × 11
symbol date open high low close volume adjusted daily_ret
<chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 ADA-USD 2023-01-02 0.250 0.256 0.247 0.254 159328803 0.254 NA
2 ADA-USD 2023-01-03 0.254 0.255 0.251 0.253 153555529 0.253 -0.00407
3 ADA-USD 2023-01-04 0.253 0.270 0.252 0.268 289945179 0.268 0.0589
4 ADA-USD 2023-01-05 0.268 0.270 0.264 0.269 175511469 0.269 0.00532
5 ADA-USD 2023-01-06 0.269 0.279 0.268 0.279 326480796 0.279 0.0355
6 ADA-USD 2023-01-07 0.279 0.280 0.273 0.277 166488086 0.277 -0.00555
# ℹ 2 more variables: avg_daily_volume <dbl>, volume_rank <int>
Generate average crypto daily return by volume rank:
# A tibble: 5 × 2
volume_rank avg_daily_ret_by_volume_sorting
<int> <dbl>
1 1 0.0228
2 2 0.0109
3 3 0.0126
4 4 0.00829
5 5 0.0121
Visualize (bar plot): basic plot
Problems:
- volume_rank is considered as numeric
- labels, themes, etc
Finalizing plot:
portfolio_analysis |>
mutate(volume_rank = as.factor(volume_rank)) |>
ggplot(aes(x = volume_rank, y = avg_daily_ret_by_volume_sorting, fill = volume_rank)) +
geom_col() +
labs(
title = "Average Daily Crypto return by Volume Sorting",
subtitle = "January 2023 ",
caption = "Quintile Volume Sorting from 2020-2022 Data",
x = "Volume Rank (1 Low 5 High Volume)",
y = "Average Daily Return (%)",
fill = "Volume Rank"
) +
scale_y_continuous(labels = scales::percent_format()) +
scale_fill_brewer(palette = "Set1") +
theme_bw()Notes
This analysis for a demo. For a more rigor, consider:
- Expand the sample scope (8~9 cryptos may not be good enough to generalize)
- Test on different time frames
Factor Vectors
Factors
Factors represent categorical variables that contains a fixed and known set of possible values.
They’re useful when you want to display character vectors in a specific, non-alphabetical order.
I introduce forcats::fct() instead of base R’s factor(), which improves its behavior.
Why Use Factors?
Factors solve two common problems with character vectors:
Typos and invalid entries: Factors restrict inputs to predefined categories.
Sorting: Factors can sort according to a custom order, rather than alphabetically.
Creating Factor Variables
Use the forcats::fct() function from the forcats package (part of the tidyverse):
[1] BBB AA A CCC
Levels: AAA AA A BBB BB B CCC
Sorting factors respects the defined level’s sequence, as if it was an order:
If values not in the levels appear, forcats::fct() raises an error:
Error in `fct()`:
! All values of `x` must appear in `levels` or `na`
ℹ Missing level: "D"
Error: object 'ratings_invalid' not found
If levels are not mentioned when defined, it honors initial input ordering:
[1] "USA" "Canada" "South Korea"
Base R’s factor() doesn’t behave like this, but uses alphabetical ordering, which is discouraged behavior.
Accessing Factor Levels
level is very important attribute of factor objects.
To modify or browse level attribute, use levels(). If levels change, corresponding values are recoded.
[1] "AAA" "AA" "A" "BBB" "BB" "B" "CCC"
[1] 4 2 3 7
Levels: 1 2 3 4 5 6 7
AAA becomes 1, AA becomes 2, and so on, preserving the order.
Recoding Factor Levels
A convenient and safer way to recode is to use fct_recode()
rating_levels <- c(
"AAA", "AA", "A",
"BBB", "BB", "B", "CCC") # possible values
bond_ratings <- fct(
c("AAA","BBB", "A", "BB", "AA"),
levels = rating_levels)
bond_ratings <- fct_recode(bond_ratings,
"Top Tier" = "AAA",
"High Grade" = "AA",
"Medium Grade" = "A",
"Lower Grade" = "BBB",
"Speculative" = "BB"
)
bond_ratings # B and CCC level remains[1] Top Tier Lower Grade Medium Grade Speculative High Grade
Levels: Top Tier High Grade Medium Grade Lower Grade Speculative B CCC
Collapsing Factor Levels
Use fct_collapse() to combine multiple levels:
[1] Investment Investment Investment Speculative Speculative Speculative
Levels: Investment Speculative
Reorder Factor level by hand
You can reorder level by using fct_relevel(). Any levels not mentioned will be left in their existing order, after the explicitly mentioned levels.
stock_returns <- tibble(
Ticker = fct(c("AAPL", "MSFT", "GOOG", "JPM", "BAC")),
Sector = fct(c("Technology", "Technology", "Technology", "Financial", "Financial")),
Return = c(0.12, 0.08, 0.10, 0.05, 0.04)
)
stock_returns |>
mutate(Reordered_Ticker = fct_relevel(Ticker, "GOOG")) |>
pull(Reordered_Ticker)[1] AAPL MSFT GOOG JPM BAC
Levels: GOOG AAPL MSFT JPM BAC
Reorder Factor level by variable
Use fct_reorder(f,x) to reorder factor level according to x. It doen’t change position of real value, but level!
[1] AAPL MSFT GOOG JPM BAC
Levels: BAC JPM MSFT GOOG AAPL
Lump infrequent levels
fct_other() lumps together infrequent levels to “other” category.
[1] AAA AA A BBB Other Other Other AA Other Other BBB Other
[13] Other A BBB BBB
Levels: A AA AAA BBB Other
Lump infrequent levels by frequency
fct_lump() lumps together infrequent levels to “other” category, by n or prop
[1] Other AA A BBB Other B CCC AA CCC B BBB Other
[13] Other A BBB BBB
Levels: A AA B BBB CCC Other
Anonymize Levels
The fct_anon() function replaces the existing levels with anonymous (generic) labels.
Exercises
Create a factor market_regime from the vector c("Bear", "Sideways", "Bull") such that the order is Bull, Sideways, then Bear.
Then, recode the levels to “Downturn”, “Flat”, and “Upturn”
Use fct_recode() to change the names.
Given the following tibble of regional sales data:
Reorder the region factor based on avg_sales in ascending order and create a bar plot showing average sales by region.
Use fct_reorder(region, avg_sales) inside ggplot(aes())
You have a factor investment_style with the following values:
Collapse the factor into two groups using fct_collapse()
Traditional: includes “Growth”, “Value”, “Blend”
Alternative: includes “Contrarian”, “Speculative”
Given a vector of currency codes:
Use fct_other() to lump together any currency other than “USD” as “Other” category.
Use fct_lump() to lump infrequent currency as “Other”, using n or prop
Logical Vectors
Logical vectors
They contain three possible values: TRUE, FALSE, and NA. Used extensively in data filtering, comparisons, and conditional transformations.
NA in other types
Though NA is logical, since other atomic vectors (integer,double,character) can contain missing values. There are corresponding NAs for each types:
NA_integer_ for integer
NA_real_ for double
NA_character_ for character
R handles the type conversion automatically when needed, so users don’t need to use it manually.
Review Logic
Q1. What is TRUE & FALSE?
Q2. What is TRUE | FALSE?
Q3. What is TRUE & TRUE?
Q4. What is FALSE | FALSE?
Q5. What is TRUE | TRUE?
Q6. What is FALSE & FALSE?
Missing Values: NA
NA represents missing data
Comparisons with NA return NA
Use is.na() to check for missing values.
NA and Logical Operators
Keep in mind the logic:
TRUE or whichever is TRUE
FALSE and whichever is FALSE
Guess the results:
Logical Values from Comparisons
Comparison operators: <, <=, >, >=, !=, ==
Modulo
The modulo operator (%%) is very useful for testing the divisibility of numbers
[1] 2 4 6 8 10
[1] 1 3 5 7 9
Boolean Algebra
New: Exclusive OR: xor
Combining Logical Vectors
[1] FALSE FALSE FALSE TRUE FALSE
[1] TRUE FALSE FALSE TRUE TRUE
[1] TRUE TRUE TRUE FALSE FALSE
[1] TRUE TRUE TRUE FALSE TRUE
&: Element-wise AND
|: Element-wise OR
!: Negation
%in% Operator
Checks whether an element is found in other set.
Short-circuit operators
&& and || are short-circuit operators
- Only evaluate on scalars and scalar output
- Useful in programming (e.g., control flow)
- Don’t use in
dplyrfunctions!
Floating point comparison
Checking equivalence with == with numeric (real numbers) is discouraged:
When checking with ==:
Why?
- There’s no way to exactly represent
1/49orsqrt(2)with fixed numbers whith decimal places - Computers store “close enough” numbers for real numbers
That’s why == was failing.
To compare real numbers, use dplyr::near() function.
Logical Summaries
There are two logical summaries: any() and all()
any(x) is equivalent of |
TRUEif any ofxisTRUEincludingNA
all(x) is equivalent of &
TRUEif all ofxareTRUEFALSEif any ofxisFALSEincludingNA
Check below:
Exercise
- Write a line of code to create a logical vector from the numbers 1 to 10 that tests whether each number is greater than 5.
- Hint: Use the comparison operator
>on the vector1:10.
- Using the modulo operator, write code to extract the even numbers from the vector 1:20.
- Hint: A number is even if it leaves a remainder of 0 when divided by 2 (i.e., use
%%)
Explain the result of TRUE & NA. What does this tell you about logical operations involving NA?
Suppose you have a vector of company tickers:
and portfolio:
Write code to determine which tickers in tickers are present in portfolio. Yield a logical vector.
- Suppose you have a vector of daily returns for a stock,
- Use any() to check if there is at least one positive return.
- Use all() to check if all daily returns are positive.
- Explain the outputs considering the presence of NA values.
- Check documentation
?any()and?all()
Numeric Vectors
Numeric Vectors
Numeric vectors are the backbone of financial data. Numerics include:
- Integer
- Double (real, or float)
We will use tidyverse verbs—to manipulate numeric data in real-world finance examples.
Parse numbers
Sometimes numbers are stored as strings (characters), especially when data was imported from external sources.
parse_double() converts strings that are purely numeric
parse_number() extracts numeric parts from strings
parse_double() example:
parse_number() example:
pmin, pmax
These functions compare values element-wise (rowwise in tibble).
Modular Arithmetic
Modular arithmetic is useful for breaking down composite numbers.
%/%integer division (quotient)%%modulo operator (remainder)
For example, convert a time value in HHMM format to hours and minutes:
Logarithms
In Finance, logarithmic returns are often used. In R, log() is natural log. log2() and log10() have base of 2 and 10.
Logarithmic (Log) Returns
Calculated as the natural logarithm of the ratio of consecutive prices:
\(r_{log} = \ln\left(\frac{P_t}{P_{t-1}}\right)\)
Log returns are additive over time, which makes cumulative calculations more straightforward.
Key Differences between arithmetic and logarithmic returns:
Additivity
Log returns can be summed get the cumulative return, arithmetic returns must be compounded
Approximation
For small returns, log returns are very similar to arithmetic returns, but the difference becomes significant for larger returns
Returns comparison
Let’s see how to compute cumulative returns using both methods.
Cumulative Returns
Rounding
Rounding is key for reporting. Use round(), floor(), and ceiling().
Cuts
cut() bins numeric values into discrete intervals with custom breaks.
Offsets
dplyr::lead() and dplyr::lag() allow you refer to values just before or after.
Positions
Extract positions: first(), last(), nth()
Exercise
For the price:
Round price to
- The nearest whole number
- Two decimal places
- Nearest ten
Parse below character vector of prices properly:
Date/Time Vectors
Dates and Time in Finance
In finance, tracking dates and times is critical for modeling transactions, trade dates, settlement dates, and market events.
Although dates and times seem straightforward, they involve complexities such as:
- leap years
- time zones
- daylight saving time
Create date and time
There are three types:
- A date, tibble prints as
<date> - A datetime, tibble prints as
<dttm>also referred “POSIXct” - A time, tibble prints as
<time>fromhms
R doesn’t have a native class for time, but tidyverse (hms) offers it.
Simple Date and Time Vectors
today() and now() creates date and datetime class vectors.
Parse Date when Import
If external data has standard (i.e, ISO8601) date and datetime, read_csv() will automatically parse it.
Parse Date with Manual Formatters
If external data has an ambiguous format, you can manually specify the format to handle.
Date/Time formatters
| Type | Code | Meaning | Example |
|---|---|---|---|
| Year | %Y |
4 digit year | 2021 |
%y |
2 digit year | 21 | |
| Month | %m |
Number | 2 |
%b |
Abbreviated name | Feb | |
%B |
Full name | February | |
| Day | %d |
One or two digits | 2 |
| Time | %H |
24-hour hour | 13 |
%M |
Minutes | 35 | |
%S |
Seconds | 45 | |
%I |
12-hour hour | 1 | |
%p |
AM/PM | pm | |
%Z |
Time zone name | America/Chicago |
Exercise: Guess the Format!
Guess the correct Date/Time Format:
- “2021-07-25”
- “Jan, 1, 2011”
- “07/25/21”
- “2021-07-25 14:35:45”
- “07/25/2021 02:35 PM”
- “2021-07-25 14:35:45 EST”
- “25 July 2021”
Parse Strings to Date
Some cases are not handled perfectly by datetime format such as:
- “May 1st, 2023”
- “May 23rd, 2023”
lubridate package has nice handlers for those cases.
Parse Strings to Datetime
lubridate package has nice handlers for datetime as well. Timezone must be specified correctly.
trade_datetime_24 <- "2023-05-15 09:30:00"
trade_datetime_12 <- "May 15, 2023 09:30 AM"
trade_datetime_24_tz <- "2023-05-15 09:30:00 EST"
ymd_hms(trade_datetime_24) # UTC by default
ymd_hms(trade_datetime_24, tz = "EST") # set time zone at EST
mdy_hm(trade_datetime_12)
ymd_hms(trade_datetime_24_tz) # time zone should be mentionedTime zones
Time zones is not just a formatting. It changes underlying values especially when datetime is parsed from strings.
Robust Time Zones
If you’re American you’ll know “EST” for Eastern Standard Time, but both Austrailia and Canada also have EST!
R uses international standard, IANA time zones, {area}/{location}.
Changing Time Zones
There are two scenarios that you want to change time zones:
- Keep the instance but change time formatting
- Like converting time using world clock
- Keep the time formatting but change instance
- Usually to fix the data error
Keep the instance but formatting
with_tz() will keep the instance but change the time zone.
- As you would see from world clock!
Exercise
For following timezone, change time zone to Chicago keeping the instance.
Keep the formatting but instance
force_tz() will keep the time formatting but change the instance.
- To fix the data error
Exercise
For following timezone, change time zone to Chicago keeping the instance.
Date/Time Components
You can pull out individual parts of the date with the accessor functions.
Exercise
What is the
- weekday
- month day
- yearday
- month
- second
of example_datetime?
Rounding Dates
In Finance, flooring date / time is often used to matche frequency and most relevant information at a specific time.
floor_date(), ceiling_date() and round_date()
last_traded_time <- ymd_hms("2024-09-08 13:33:45.653 EST", tz = "EST") # milliseconds
floor_date(last_traded_time) # by default, second
ceiling_date(last_traded_time)
floor_date(last_traded_time, unit = "10 seconds")
floor_date(last_traded_time, unit = "15 mins")
floor_date(last_traded_time, unit = "2 hours")Exercise
- Convert the string into datetime object. Use Time zone:
America/New_York
- Round down above trading time by “1 day”, “10 hours”, “5 minutes”, “10 seconds”
Missing Values
Missing Values
Missing values frequently appear in fianancial datasets.
Two types of missingness:
Explicit missing: values marked
NA- Presence of absence
Implicit missing: absent rows that should be
- Absense of presence
Example
| company | year | quarter | revenue |
|---|---|---|---|
| AAPL | 2020 | 1 | 100 |
| AAPL | 2020 | 2 | NA |
| AAPL | 2020 | 3 | 110 |
| AAPL | 2021 | 1 | 200 |
| TSLA | 2021 | 1 | 210 |
| TSLA | 2021 | 2 | 220 |
Explicit misssing:
NAvalues on the revenue
Implicit missing:
- AAPL: Q4 missing year 2020, …
- TSLA: year 2020, and Q3, Q4 missing year 2021,
Implicit Missing Values
Generally, you want to reveal those implicit missing cases as explicit. tidyr::complete() is handy for this operation.
tidyr is included in tidyverse.
# A tibble: 12 × 4
company year quarter revenue
<chr> <dbl> <dbl> <dbl>
1 AAPL 2020 1 100
2 AAPL 2020 2 NA
3 AAPL 2020 3 110
4 AAPL 2021 1 200
5 AAPL 2021 2 NA
6 AAPL 2021 3 NA
7 TSLA 2020 1 NA
8 TSLA 2020 2 NA
9 TSLA 2020 3 NA
10 TSLA 2021 1 210
11 TSLA 2021 2 220
12 TSLA 2021 3 NA
Since Q4 was missing for all, complete() fails to make every missing values explicit.
In this case, you can provide your own data.
Explicit Missing Values
There are roughly 3 methods to handle missing values in Finance:
- Filling with designated value
- e.g., Replace
NAto 0
- Last observation carried forward
- e.g., Use most recent past observation to fill
NA - c.f., Next observation carried backward
- Linear Interpolation
- e.g., fill
NAwith incremental values in between
Filling with designated value
Heuristic approach where you simply know (or assume) NA values. ifelse() is useful technique.
# A tibble: 16 × 5
company year quarter revenue revenue_filled
<chr> <dbl> <dbl> <dbl> <dbl>
1 AAPL 2020 1 100 100
2 AAPL 2020 2 NA 100
3 AAPL 2020 3 110 110
4 AAPL 2020 4 NA 100
5 AAPL 2021 1 200 200
6 AAPL 2021 2 NA 100
7 AAPL 2021 3 NA 100
8 AAPL 2021 4 NA 100
9 TSLA 2020 1 NA 100
10 TSLA 2020 2 NA 100
11 TSLA 2020 3 NA 100
12 TSLA 2020 4 NA 100
13 TSLA 2021 1 210 210
14 TSLA 2021 2 220 220
15 TSLA 2021 3 NA 100
16 TSLA 2021 4 NA 100
Or you can use tidyr::replace_na() function.
# A tibble: 16 × 4
company year quarter revenue
<chr> <dbl> <dbl> <dbl>
1 AAPL 2020 1 100
2 AAPL 2020 2 100
3 AAPL 2020 3 110
4 AAPL 2020 4 100
5 AAPL 2021 1 200
6 AAPL 2021 2 100
7 AAPL 2021 3 100
8 AAPL 2021 4 100
9 TSLA 2020 1 100
10 TSLA 2020 2 100
11 TSLA 2020 3 100
12 TSLA 2020 4 100
13 TSLA 2021 1 210
14 TSLA 2021 2 220
15 TSLA 2021 3 100
16 TSLA 2021 4 100
Last observation carried forward
tidyr::fill() offers convenient filling options. It works like select() function.
- "down": fill downwards (LOCF)
- "up": fill upwards (NOCB)
- "downup": LOCF then NOCB
- "updown": NOCB then LOCF
When filling in direction, grouping and arranging is important.
If you use LOCF, below is the correct approach:
Linear Interpolation
Interpolation is when you want to estimate a value between two known points. approx() function is a handy tool.
By default, it makes 50 split along the length of the vector and give esimated values.
[1] 5
$x
[1] 1.000000 1.081633 1.163265 1.244898 1.326531 1.408163 1.489796 1.571429
[9] 1.653061 1.734694 1.816327 1.897959 1.979592 2.061224 2.142857 2.224490
[17] 2.306122 2.387755 2.469388 2.551020 2.632653 2.714286 2.795918 2.877551
[25] 2.959184 3.040816 3.122449 3.204082 3.285714 3.367347 3.448980 3.530612
[33] 3.612245 3.693878 3.775510 3.857143 3.938776 4.020408 4.102041 4.183673
[41] 4.265306 4.346939 4.428571 4.510204 4.591837 4.673469 4.755102 4.836735
[49] 4.918367 5.000000
$y
[1] 0.0100000000 0.0091836735 0.0083673469 0.0075510204 0.0067346939
[6] 0.0059183673 0.0051020408 0.0042857143 0.0034693878 0.0026530612
[11] 0.0018367347 0.0010204082 0.0002040816 -0.0006122449 -0.0014285714
[16] -0.0022448980 -0.0030612245 -0.0038775510 -0.0046938776 -0.0055102041
[21] -0.0063265306 -0.0071428571 -0.0079591837 -0.0087755102 -0.0095918367
[26] -0.0087755102 -0.0063265306 -0.0038775510 -0.0014285714 0.0010204082
[31] 0.0034693878 0.0059183673 0.0083673469 0.0108163265 0.0132653061
[36] 0.0157142857 0.0181632653 0.0206122449 0.0230612245 0.0255102041
[41] 0.0279591837 0.0304081633 0.0328571429 0.0353061224 0.0377551020
[46] 0.0402040816 0.0426530612 0.0451020408 0.0475510204 0.0500000000
You can get only certain observations with xout argument. Notice it generates a list output with x and y.
To pull the interpolated results, access y from the result.
It is easy to visualize the results from linear approximation:
If you have values to specify for x-axis to calculate slope:
Treasury Yield Interpolation
For example, fill the straight line estimate for “9 month” yield.
The treasury daily yield data looks like below.
# A tibble: 4 × 4
date x6_mo x1_yr x2_yr
<date> <dbl> <dbl> <dbl>
1 2025-04-01 4.23 4.01 3.87
2 2025-04-02 4.24 4.04 3.91
3 2025-04-03 4.2 3.92 3.71
4 2025-04-04 4.14 3.86 3.68
To interpolate, you’ll need to pivot the data and make an explicit missing value:
# A tibble: 16 × 3
date name value
<date> <fct> <dbl>
1 2025-04-01 x6_mo 4.23
2 2025-04-01 x9_mo NA
3 2025-04-01 x1_yr 4.01
4 2025-04-01 x2_yr 3.87
5 2025-04-02 x6_mo 4.24
6 2025-04-02 x9_mo NA
7 2025-04-02 x1_yr 4.04
8 2025-04-02 x2_yr 3.91
9 2025-04-03 x6_mo 4.2
10 2025-04-03 x9_mo NA
11 2025-04-03 x1_yr 3.92
12 2025-04-03 x2_yr 3.71
13 2025-04-04 x6_mo 4.14
14 2025-04-04 x9_mo NA
15 2025-04-04 x1_yr 3.86
16 2025-04-04 x2_yr 3.68
Then, generate a numeric column to help interpolating the yield estimates.
# A tibble: 6 × 4
date name value days
<date> <fct> <dbl> <dbl>
1 2025-04-01 x6_mo 4.23 180
2 2025-04-01 x9_mo NA 270
3 2025-04-01 x1_yr 4.01 360
4 2025-04-01 x2_yr 3.87 720
5 2025-04-02 x6_mo 4.24 180
6 2025-04-02 x9_mo NA 270
Finally, interpolate with approx() function. Notice the use of group_by() in this operation.
# A tibble: 16 × 5
# Groups: date [4]
date name value days value_interpolated
<date> <fct> <dbl> <dbl> <dbl>
1 2025-04-01 x6_mo 4.23 180 4.23
2 2025-04-01 x9_mo NA 270 4.12
3 2025-04-01 x1_yr 4.01 360 4.01
4 2025-04-01 x2_yr 3.87 720 3.87
5 2025-04-02 x6_mo 4.24 180 4.24
6 2025-04-02 x9_mo NA 270 4.14
7 2025-04-02 x1_yr 4.04 360 4.04
8 2025-04-02 x2_yr 3.91 720 3.91
9 2025-04-03 x6_mo 4.2 180 4.2
10 2025-04-03 x9_mo NA 270 4.06
11 2025-04-03 x1_yr 3.92 360 3.92
12 2025-04-03 x2_yr 3.71 720 3.71
13 2025-04-04 x6_mo 4.14 180 4.14
14 2025-04-04 x9_mo NA 270 4
15 2025-04-04 x1_yr 3.86 360 3.86
16 2025-04-04 x2_yr 3.68 720 3.68
Exercise
- Use
ifelse()to create a new column return_filled where missing returns are filled with 0, assuming no change in stock price on those days. - Use
tidyr::replace_na()to achieve the same result, replacing NA values with 0 in the return column.
Use
tidyr::complete()to add the missing quarters for 2020 and 2021. Assume that each year should have quarters 1 to 4 (Q1, Q2, Q3, Q4). The missing revenue values should appear as NA.Fill the missing revenue values using the LOCF method. Ensure the data is properly arranged by year and quarter before applying
tidyr::fill().
- Use the
approx()function to interpolate the yield for “9 Mo” based on the days and yield columns. Provide the interpolated yield value as your answer.
Character Vectors
Strings
Characters (Strings) store text information in finance such as
- earnings announcements
- analyst opinions
- descriptions
- investment sentiments, etc.
We’ll mostly use stringr package (included in tidyverse)
Generate Strings
You can create strings by wrapping values with singgle quote (') or double quotes (").
Print Strings
There are multiple string printing functions in R: print(),cat() and str_view()
print()gives you full structure of underlying stringcat()shows rendered string outputstr_view()shows rendered output (robust)
Escape Strings
Special characters (quotes, backslash, backticks, etc.) has their reserved use, and if you want to include them, you have to escape with backslash \.
Other Special Characters
There are some other special characters worth remembering:
\nnewline\ttab\UUnicode escapes
Tricky Escapes
Creating a string with multiple quotes, backslashes, gets confusing so quickly! For example:
Without raw strings,
Double backslashes \\ and
double double quotes '""' with quotes
will make you crazy.
This is called Leaning Toothpick Syndrome
Raw Strings
To eliminate escaping, you can use raw string with r"()", r"{}", r"[]".
Exercise
Create strings that contain the following values:
- He said “We have a beautiful announcement today!”
- \a\b\c\d
- \\\
Creating strings
str_c() concatenates multiple string vectors, element-wise.
For example, combine a financial report header with
glue strings
str_glue() improves readability by allowing embedded expressions within {}.
Earnings: Microsoft Inc. reported strong results.
Also works with vectorized operations with recycling.
flatten strings
If you want to collapse a vector of strings into a single string, str_flatten(), or paste()
[1] "Therewillbeastrongmarketvolatility."
[1] "There will be a strong market volatility."
Base R: paste() and collapse.
Exercise
- Check out the length of the vector and length of the string of:
Flatten the above
companiescharacter vector into a scalar string.Fix below code to evaluate embed expression
companiesthen print:
Letters in Strings
Two relevant concepts related to the length:
- number of elementsin a vector:
length() - the number of characters for each elements:
str_length()ornchar()
Subsetting Letters
You can extract parts of a string using position arguements with str_sub()
[1] "Ap" "Sa" "Ze"
[1] "." "C" "s"
[1] "e Inc." "son LLC" " Investments"
[1] " Inc." "n LLC" "ments"
Pad strings
str_pad() pads a string to fixed length by adding extra whitespace on the left, right or both.
[1] " Apple" " Microsoft"
[1] "Apple " "Microsoft "
[1] " Apple " "Microsoft "
You can pad other strings, for example, leading zeros:
Lettercases
Upper / lowercase transfromations:
Exercises
Extract the first two characters from
companies.Extract the last two characters.
Transform to lowercases and uppercases.
- Pad “0” to the left so that it has 4 letters (e.g, “0001”).
Regular Expressions
RegEx
Regular Expressions (Regex) is a language for describing “patterns” within strings.
- Regex is a core tool for working with text data
- Widely supported in
stringr, tidyverse, and base R
Prep
We’ll use regular expression functions from the stringr and tidyr packages, both core members of the tidyverse.
Datasets and Examples
To explore regular expressions, we’ll use:
Three character vectors from the stringr package:
fruit: names of 80 fruitswords: 980 common English wordssentences: 720 short example sentences
These built-in datasets are great for testing regex.
Pattern Basics
str_view() highlights matches in a string vector using <>.
Literal characters match exactly:
[6] │ bil<berry>
[7] │ black<berry>
[10] │ blue<berry>
[11] │ boysen<berry>
[19] │ cloud<berry>
[21] │ cran<berry>
[29] │ elder<berry>
[32] │ goji <berry>
[33] │ goose<berry>
[38] │ huckle<berry>
[50] │ mul<berry>
[70] │ rasp<berry>
[73] │ salal <berry>
[76] │ straw<berry>
Metacharacters and Wildcards
Some characters, like ., +, and *, have special meanings in regex and are known as metacharacters.
.: A wildcard that matches any single character. For example:
Pattern Length with Wildcards
You can match specific lengths of text using . repeated:
[1] │ <apple>
[7] │ bl<ackbe>rry
[48] │ mand<arine>
[51] │ nect<arine>
[62] │ pine<apple>
[64] │ pomegr<anate>
[70] │ r<aspbe>rry
[73] │ sal<al be>rry
This matches an “a” followed by any three characters and an “e”.
Quantifiers
Quantifiers control how often a pattern appears:
?: 0 or 1 time (optional)
[1] │ <a>
[2] │ <ab>
[3] │ <ab>b
[4] │ <ab>bb
[5] │ <ab>c
+: 1 or more times
[2] │ <ab>
[3] │ <abb>
[4] │ <abbb>
[5] │ <ab>c
*: 0 or more times
Exercise
Use
str_view()to highlight pattern “ca”Use
str_view()to highlight pattern “ca” and following exactly one character (hint:.)
Character Set
Use brackets [] to define and match sets of characters. It is also called as character class. For example, [aeiou] matches any vowel.
[284] │ <exa>ct
[285] │ <exa>mple
[288] │ <exe>rcise
[289] │ <exi>st
[836] │ <sys>tem
[901] │ <typ>e
The caret ^ inside brackets negates the set.
The caret ^ outside of brackets has a different meaning: it anchors the match to the beginning of the string.
Alternation (OR)
Use | to match one of several patterns:
[1] │ <apple>
[13] │ canary <melon>
[20] │ coco<nut>
[52] │ <nut>
[62] │ pine<apple>
[72] │ rock <melon>
[80] │ water<melon>
[9] │ bl<oo>d orange
[33] │ g<oo>seberry
[47] │ lych<ee>
[66] │ purple mangost<ee>n
This finds fruits containing specified keywords or repeated vowels.
Exercise
Use
str_view()to highlight patten where “x” is surrounded by vowelsUse
str_view()to match words containing any of “flux” or “pixie”
Key functions
str_view() is good to experiment on pattern matching. Other key functions are:
str_detect(): logical check if pattern existsstr_subset(): subset elements that contains patternsstr_count(): count the occurrence of patternstr_replace(): replace patternsseparate_...(): separate by pattterns
Use case: Detect Matches
In real data, you can use str_detect() to check for the presence of a pattern.
- It returns logical vector; ideal for filtering
str_subset() and str_which()
Two other useful functions are:
str_subset(): returns the elements that contains patternstr_which(): returns the number index of elements that has pattern
Example:
[1] "These days a chicken leg is a rare dish."
[2] "Rice is often served in round bowls."
[3] "A large size in stockings is hard to sell."
[4] "A rod is used to catch pink salmon."
[5] "The source of the huge river is the clear spring."
[6] "The fish twisted and turned on the bent hook."
[1] 4 5 10 12 13 22
You can use these to extract or locate matches without altering the original data structure.
Count Matches with str_count()
Check for repeated sequences:
Count the number of matches per string:
Case Sensitivity in Regex
Sometimes your results may look off. For example, the name “Aaban” has three “a”s, but only two are counted. That’s because regex is case sensitive by default.
You can fix this in three ways:
- Add uppercase characters to the pattern:
- Use
regex(..., ignore_case = TRUE):
- Preprocess the string to lowercase:
Replace Values
str_replace() replaces the first match. str_replace_all() replaces all match.
Extract and Separate Variables
In tibble (dataframe), you can separate text into variables by
delimiter,positionandpattern(regex)
Separate by delimiter
separate_longer_delim() separates values into long form.
# A tibble: 6 × 1
x
<chr>
1 a
2 b
3 c
4 d
5 e
6 f
separate_wider_delim() separates values into wide form. You must specify names, and actions if too few or too many.
# A tibble: 3 × 3
first second third
<chr> <chr> <chr>
1 a b c
2 d e <NA>
3 f <NA> <NA>
Warning: Debug mode activated: adding variables `x_ok`, `x_pieces`, and
`x_remainder`.
# A tibble: 3 × 6
first second x x_ok x_pieces x_remainder
<chr> <chr> <chr> <lgl> <int> <chr>
1 a b a,b,c FALSE 3 ",c"
2 d e d,e TRUE 2 ""
3 f <NA> f TRUE 1 ""
Separate by position
separate_longer_position() splits by fixed width. Must specify width.
# A tibble: 5 × 1
x
<chr>
1 12
2 11
3 13
4 1
5 21
separate_wider_position() separates values into wide form.
You must specify widths with named integer vector, and actions if too few or too many.
Separate by regex
When you want to separate by regex patterns. Below is a complex sample:
# A tibble: 4 × 1
string
<chr>
1 <Sheryl>-F_34
2 <Kisha>-F_45
3 <Brandon>-M_33
4 <Sharon>-F_38
Use separate_wider_regex() to extract structured data:
Exercise
str_detect()to indicate whether each string contains a digit (hint [0-9]).str_count()to count the number of vowels.str_replace_all()to replace “a” to “e”separate_widerto separatefruitto two variables by whitespace.
- Generate tibble from
fruitfirst
Escaping
To literally match metacharacters (., ?, *) in regex, use \.
- To literal match
., regex pattern should be\. - Regex patterns are given in strings
- However, strings also escape
\with\ - You should use
"\\."to express\.
[1] │ \.
[2] │ <a.c>
To match ?, you need regex \?, and to express it \\?.
To match \, you need regex \\, and to express it \\\\
If you use raw strings in regex, it reduces one level of escaping.
[1] │ Is this crazy<?>
[1] │ <\>
Or you can escape with character set [] for some (not all) metacharacters.
- Still
\cannot be used with character set
Anchors
If you want to match at the start or end you need to anchors
^match the start$match the end\bmatch boundary between words
[1] │ <sum>mary(x)
[2] │ <sum>marize(df)
[4] │ <sum>(x)
Word boundary example:
[4] │ <sum>(x)
Match end:
Anchors also match zero-width if used alone:
[1] │ <>apple banana monkey boom
[2] │ apple banana monkey boom<>
[3] │ <>apple<> <>banana<> <>monkey<> <>boom<>
You can use this feature for replacements:
Character Set
Use [] to match any character from a set.
- e.g.,
[abc] [^abc]to exclude[a-z]-defines range\escapes special characters within[]
[1] │ <abc>d ABCD 12345 -!@#%.
[1] │ <abcd> ABCD 12345 -!@#%.
[1] │ abcd< ABCD >12345< -!@#%.>
[1] │ <a>-<b>-<c>
[1] │ <a><->b<-><c>
Character Set Shortcuts
Some character sets are so common that they have shortcuts:
\dany digit\Danything not digit\sany whitespace (space, tab, newline)\Sanything not whitespace\wany word (letters and numbers)\Wany non-word
[1] │ abcd ABCD <12345> -!@#%.
[1] │ <abcd ABCD >12345< -!@#%.>
[1] │ abcd< >ABCD< >12345< >-!@#%.
[1] │ <abcd> <ABCD> <12345> <-!@#%.>
[1] │ <abcd> <ABCD> <12345> -!@#%.
[1] │ abcd< >ABCD< >12345< -!@#%.>
Quantifiers
On top of
?(0 or 1)+(1 or more)*(0 or more)
You can specify precise quantifiers
{n}exactlyntimes{n,}at leastntimes{n,m}betweennandmtimes
Operator Precedence
Regular expressions follow precedence rules like math:
- Quantifiers (
+,?): high - Alternation (
|): low
You can use () to specify precedence and grouping.
Grouping
Parenthesis () can also used for capturing groups.
Use \1, \2, etc., to refer back to matched groups.
[4] │ b<anan>a
[20] │ <coco>nut
[22] │ <cucu>mber
[41] │ <juju>be
[56] │ <papa>ya
[73] │ s<alal> berry
[152] │ <church>
[217] │ <decide>
[617] │ <photograph>
[699] │ <require>
[739] │ <sense>
Use back references to replice:
# A tibble: 3 × 2
value new
<chr> <chr>
1 The birch canoe slid on the smooth planks. The canoe birch slid on the smoot…
2 Glue the sheet to the dark blue background. Glue sheet the to the dark blue b…
3 It's easy to tell the depth of a well. It's to easy tell the depth of a …
You can extract match with str_match() that returns matrix:
[,1] [,2] [,3]
[1,] "the smooth planks" "smooth" "planks"
[2,] "the sheet to" "sheet" "to"
[3,] "the depth of" "depth" "of"
You can extract match with str_match() that returns matrix:
[,1] [,2] [,3]
[1,] "the smooth planks" "smooth" "planks"
[2,] "the sheet to" "sheet" "to"
[3,] "the depth of" "depth" "of"
Convert to tibble:
# A tibble: 3 × 3
match word1 word2
<chr> <chr> <chr>
1 the smooth planks smooth planks
2 the sheet to sheet to
3 the depth of depth of
Or use separate_wider_regex()
# A tibble: 3 × 2
word1 word2
<chr> <chr>
1 smooth planks
2 dark blue
3 depth of
When you want to use () purely for grouping, not for capturing:
(?:)is non-capturing group
Exercise
How would you match the literal string
"'\? How about"$^$"Given the corpus of common words in stringr::words, create regular expressions that find all words that:
- Start with “y”.
- Don’t start with “y”.
- End with “x”.
- Are exactly three letters long. (Don’t cheat by using str_length()!)
- Have seven letters or more.
- Contain a vowel-consonant pair.
- Contain at least two vowel-consonant pairs in a row.
- Only consist of repeated vowel-consonant pairs.
Switch the first and last letters in
words. Which of those strings are still words?Describe in words what these regular expressions match. Rread carefully to see if each entry is a regular expression or a string that defines a regular expression.
^.*$"\\{.+\\}"\d{4}-\d{2}-\d{2}"\\\\{4}"\..\..\..(.)\1\1"(..)\\1"
Pattern Control
regex() gives more contol over pattern object, using flags.
[1] │ <banana>
[1] │ <banana>
[2] │ <Banana>
[3] │ <BANANA>
Dotall flag allows . to match all, including \n.
[1] │ Line 1<
│ Line> 2<
│ Line> 3
Multiline makes ^ and $ match start and end of each line.
[1] │ <Line> 1
│ Line 2
│ Line 3
[1] │ <Line> 1
│ <Line> 2
│ <Line> 3
Comments allows you to write comment on complex patterns.
phone <- regex(
r"(
\(? # optional opening parens
(\d{3}) # area code capturing group
[)\-]? # optional closing parens or dash
\ ? # optional space
(\d{3}) # another three numbers group
[\ -]? # optional space or dash
(\d{4}) # four digits group
)",
comments = TRUE
)
str_extract(c("514-791-8141", "(123) 456 7890", "123456"), phone)[1] "514-791-8141" "(123) 456 7890" NA
Fixed matches
Opt-out regular expression rules using fixed()
Examples
- Find all sentences that start with “The”.
[1] │ <The> birch canoe slid on the smooth planks.
[4] │ <The>se days a chicken leg is a rare dish.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.
[11] │ <The> boy was there when the sun rose.
[1] │ <The> birch canoe slid on the smooth planks.
[6] │ <The> juice of lemons makes fine punch.
[7] │ <The> box was thrown beside the parked truck.
[8] │ <The> hogs were fed chopped corn and garbage.
[11] │ <The> boy was there when the sun rose.
[13] │ <The> source of the huge river is the clear spring.
- Find all sentences begin with a pronoun
[3] │ <It>'s easy to tell the depth of a well.
[15] │ <He>lp the woman get back to her feet.
[27] │ <He>r purse was full of useless trash.
[29] │ <It> snowed, rained, and hailed the same morning.
[63] │ <He> ran half way to the hardware store.
[90] │ <He> lay prone and hardly moved a limb.
[3] │ <It>'s easy to tell the depth of a well.
[29] │ <It> snowed, rained, and hailed the same morning.
[63] │ <He> ran half way to the hardware store.
[90] │ <He> lay prone and hardly moved a limb.
[116] │ <He> ordered peach pie with ice cream.
[127] │ <It> caught its hind paw in a rusty trap.
Best Practices
How to spot such mistakes? Create few positive and negative examples and test.
Create pattern with code
If you wanted to find all sentences that mention a color?
[2] │ Glue the sheet to the dark <blue> background.
[26] │ Two <blue> fish swam in the tank.
[92] │ A wisp of cloud hung in the <blue> air.
[148] │ The spot on the blotter was made by <green> ink.
[160] │ The sofa cushion is <red> and of light weight.
[174] │ The sky that morning was clear and bright <blue>.
What if the colors are so many and stored in data, like:
First you want to remove numbers from colors:
[1] "white" "aliceblue" "antiquewhite" "aquamarine"
[5] "azure" "beige" "bisque" "black"
[9] "blanchedalmond" "blue"
Now you can generate patterns using R code:
Application: Financial News
To fetch news data, you’ll need API key from
News data prep
The data.frame is stored in second level.
# A tibble: 6 × 9
author title description url url_to_image published_at content id
<chr> <chr> <chr> <chr> <chr> <dttm> <chr> <chr>
1 Waqas Bith… "A system … http… https://hac… 2026-02-07 20:08:59 "On 6 … <NA>
2 Oluwap… Trum… "The crypt… http… https://cry… 2026-02-07 20:05:46 "The c… <NA>
3 Editor Inte… "Podcast: … http… https://www… 2026-02-07 20:00:00 "Podca… <NA>
4 Diana … How … "By 2050, … http… https://www… 2026-02-07 20:00:00 "Young… <NA>
5 Editor… Mike… "Stablecoi… http… https://sta… 2026-02-07 19:35:52 "Stabl… <NA>
6 Gareth… As t… "As the We… http… https://liv… 2026-02-07 19:31:07 "Is Au… abc-…
# ℹ 1 more variable: name <chr>
Application: Financial News
To filter financial news that mention “uncertain”:
# A tibble: 0 × 3
# ℹ 3 variables: author <chr>, title <chr>, description <chr>
Filter news that mention “uncertain” or “risk” or “option” or “down”,
# A tibble: 12 × 3
author title description
<chr> <chr> <chr>
1 Editor Interview 1999 – Gold Rush as Dollar Cra… "Podcast: …
2 Kurt Zindulka Half of British Voters Want Prime Minist… "Half of B…
3 Glenn Carle FO Exclusive: Global Lightning Roundup o… "Editor-in…
4 Bloomberg News Charting the Global Economy: ECB Holds, … "The Europ…
5 Juliana Kim DVDs and public transit: Boycott drives … "A sweepin…
6 Rafael Nam Trump promised a crypto revolution. So w… "Trump got…
7 The White Coat Investor 13 Reasons I Still Own Bonds "For some …
8 Reuters Iran's surging crypto activity draws US … "Crypto us…
9 Jake Simmons Kevin Warsh Will Trigger Bitcoin Regime … "Bitcoin’s…
10 James Halver Mining Stocks And Asian Markets Hit As B… "Bitcoin’s…
11 Everygame Casino Super Bowl Betting Promos: Everygame's L… "Everygame…
12 Bovada Super Bowl Betting Sites: Bovada's Welco… "An inform…
You can make sentiment polarity with simple lexicon matching:
positive_words <- c("gain", "rally", "beat", "surge", "growth", "record", "optimism", "strong")
negative_words <- c("loss", "fall", "miss", "drop", "decline", "weak", "concern", "crisis")
news_frame |>
mutate(
text = str_to_lower(title),
pos = str_count(text, str_c("\\b(", str_c(positive_words, collapse = "|"), ")\\b")),
neg = str_count(text, str_c("\\b(", str_c(negative_words, collapse = "|"), ")\\b")),
sentiment_score = pos - neg
)# A tibble: 96 × 13
author title description url url_to_image published_at content id
<chr> <chr> <chr> <chr> <chr> <dttm> <chr> <chr>
1 Waqas Bith… "A system … http… https://hac… 2026-02-07 20:08:59 "On 6 … <NA>
2 Oluwa… Trum… "The crypt… http… https://cry… 2026-02-07 20:05:46 "The c… <NA>
3 Editor Inte… "Podcast: … http… https://www… 2026-02-07 20:00:00 "Podca… <NA>
4 Diana… How … "By 2050, … http… https://www… 2026-02-07 20:00:00 "Young… <NA>
5 Edito… Mike… "Stablecoi… http… https://sta… 2026-02-07 19:35:52 "Stabl… <NA>
6 Garet… As t… "As the We… http… https://liv… 2026-02-07 19:31:07 "Is Au… abc-…
7 Quent… Afte… "The specu… http… https://s.y… 2026-02-07 19:30:00 "Galax… <NA>
8 Bloom… Tech… "The bigge… http… https://sma… 2026-02-07 19:29:26 "(Bloo… fina…
9 Joe W… Prof… "\"Sharp b… http… https://fut… 2026-02-07 19:15:00 "Follo… <NA>
10 Rankp… Rank… "KUALA LUM… http… https://ml.… 2026-02-07 19:11:00 "KUALA… <NA>
# ℹ 86 more rows
# ℹ 5 more variables: name <chr>, text <chr>, pos <int>, neg <int>,
# sentiment_score <int>
Exercises
For each of the following challenges, try solving it by using both a single regular expression, and a combination of multiple
str_detect()calls.- Find all words that start or end with x.
- Find all words that start with a vowel and end with a consonant.
- Are there any words that contain at least one of each different vowel?
colors()contains a number of modifiers like “light”, “dark”, “medium” as in “lightgray” and “darkblue”. How could you automatically identify these modifiers?
- Think about how you might detect and then remove the colors that are modified.
Large Language Models
LLMs in Finance
The use of LLMs in financial data analysis can be very effective.
- Financial news and transcripts summary
- Cleaning unstructured filings
- Sentiment, and Q&A
- Code assistant
LLM Deployments in R
Cloud LLMs
ellmersupports cloud LLM backends- Requires API key variables (e.g.,
OPENAI_API_KEY) - Pros: Zero setup, scalable
- Cons: Costs, privacy risks
Local LLMs
ollamarandmallpackageellmeralso supports local- Pros: No data leakage, offline
- Cons: Hardware limits, setup
Package: ellmer
ellmer connects to cloud/local LLMs.
- Supports multiple providers
- OpenAI, Gemini, Claude, Groq, Ollama (local), Deekseek, Perplexity, …
Prep: Google Gemini
Google gemini provides free tier APIs.
Get the API key
Set the chat machine
- Test the chat machine:
Interactive Console
You can use in interactive mode with live_console() or live_browser().
Test out yourselves:
Prompt Engineering
Core Principles for effective use of LLMs:
Give the Role & Context
Source Delimiters
Explicit Output Format
Determinism Controls
Principle 1: Role + Context
Define LLM’s role and financial context.
- Aligns responses with domain expertise.
Examples
- Specify role: “You are a sell-side equity analyst.”
- Add context: “Focus on tech stocks in a bullish market.”
Principle 2: Source Delimiters
Wrap input text (news, filings) in triple back-ticks to clarity input boundaries.
Examples
- In prompt: “Analyze the news
news_frame$content[[1]]”
**Neutral with a cautious tone.**
Here's why:
* **Positive Indicators:** "signs of recovery," "industrial output
accelerating," "retail sales improving," "worst...may be over," "continued
recovery," "broadly encouraging."
* **Negative/Cautious Indicators:** "slower pace," "analysts caution,"
"recovery remains fragile and uneven," "exports still facing headwinds," "weak
global demand."
The article presents an initial set of positive economic data but immediately
follows it with significant caveats and challenges. It's not definitively
positive because of the strong warnings, nor is it negative because it details
real improvements. The overall sentiment is one of cautious optimism,
acknowledging progress while highlighting ongoing vulnerabilities, which makes
it lean towards neutral in its overall assessment of the situation's current
state.
Principle 3: Explicit Output Format
Request structured outputs for your analysis. When multiple answers are expected, JSON format is recommended.
Examples
- “Strictly answer Yes or No.”
- “Return JSON: {sentiment: value, confidence: 0-1}””
Principle 4: Determinism Controls
Set temperature = 0 to ensure consistent responses and limit token budget to control costs.
Temperature in LLM is a paramter that controls randomness (0-2 range). Low level (0) gives consistent, predictable and rigid outputs. High level (1) gives create and varied responses.
Tokens are similar to word counts, that measures the weight of the information in input/output text.
Tune and Build the LLM machine
Let’s setup a financial news analyzer machine with prompt engineering, as example below.
news_analyzer <- chat_google_gemini(
system_prompt = r"{
You are an expert financial analyst.
You will be provided news article title to analyze, which will be wrapped with tripple backticks ```.
Your task is to assess the market sentiment of a news article.
Return valid JSON with curly braces without any other formatting:
– "score": a real number between [0, 1] (0 = extremely negative, 1 = extremely positive).
– "rationale": less than 25 words.
Do not add any keys, text, or commentary outside the JSON object.
}",
# api_key = "Your_API_KEY",
api_args = list(
generationConfig = list(
temperature = 0,
maxOutputTokens = 100
)
)
)Using model = "gemini-2.5-flash".
Prep: News data
Prepare news dataframe from newsanchor.
Run on single Document
As a test run:
Clean output with Regex
Since the output always contains json markdown formatter, we can clean with regex.
Parsing JSON formats
jsonlite package pases json format strings.
Deploy model on data
Now, we can analyze sentiment of financial news titles.
Step 1: Build a function that
- Reads title and
- Generates LLM results
- Cleans it
get_sentiment <- function(title){
prompt <- str_glue("Tell me the sentiment of this article: ```{title}```")
llm_response <- news_analyzer$chat(prompt, echo = "none")
clean_response <-
str_replace_all(llm_response, "\\n", "") |>
str_replace(
"^```json(.*)```$", # capturing group
"\\1"
) |>
fromJSON()
return(clean_response)
}Step 2: Map the function
Step 3: Tidy the data (unnest_wider())
Exercise
Collect and analyze 10 financial news articles for market sentiment.
- Choose a financial topic (e.g., “stock market”, “cryptocurrency”).
- Collect 10 articles using
newsanchorwith above topic. - Replicate class examples to generate sentiment scores and rationales.
Comments
Comments are meant to explain the code and redable, but ignored by the computer.
#to make comment#on the line are ignoredUse
Ctrl (Cmd) + /hotkey to toggle comments.